optionally use MADV_GUARD_INSTALL for large allocation guard pages by thomasbuilds · Pull Request #341 · GrapheneOS/hardened_malloc

thomasbuilds · 2026-05-29T20:45:59Z

Addresses the high-VMA-count concern from KERNEL_FEATURE_WISHLIST.md (see #258). MADV_GUARD_INSTALL (Linux 6.13+) lets guard regions live inside a single read-write mapping at the page-table level instead of as separate PROT_NONE VMAs.

Change

Adds CONFIG_GUARD_PAGES_USE_MADVISE (default false). When enabled, guard regions for large allocations are installed with MADV_GUARD_INSTALL inside one read-write mapping rather than carved out as separate PROT_NONE mappings, keeping each large allocation to a single VMA instead of three. This is applied in allocate_pages(), allocate_pages_aligned(), the region quarantine, and the in-place realloc shrink, so the single-VMA property holds under allocation churn rather than only for live allocations. It also holds across the mremap growth path: guard markers move with the mapping and the moved body merges into the never-faulted destination fragments (verified on 6.17). For aligned allocations the installed guards sit exactly adjacent to the usable region, giving the clean [guard][usable][guard] layout discussed in #350.

Syscall cost: MADV_GUARD_INSTALL zaps any existing pages in the range, so no separate MADV_DONTNEED is needed and the quarantine and the shrink path stay at one syscall each (one madvise instead of one mmap). Only allocation pays one extra syscall (mmap + 2x madvise instead of mmap + mprotect).

Kernel support is probed once at runtime on a fresh mapping and cached. Guard installation is best-effort: any madvise failure falls back to the existing PROT_NONE scheme rather than failing the allocation, preserving errno. MADV_GUARD_INSTALL returns EINVAL on VM_LOCKED mappings; that resets the cached state so the next allocation re-probes: under mlockall(MCL_FUTURE) the probe mapping is itself locked and the feature latches off rather than being retried per allocation, while freeing a one-off mlock'd allocation only loses that single call. Under CONFIG_LABEL_MEMORY the quarantined region is labeled as a whole so PR_SET_VMA_ANON_NAME does not split the single VMA back into three.

One sharp edge is documented rather than fixed: guard install is not atomic, so if it fails partway through the realloc-shrink path and the PROT_NONE fallback also fails (two ENOMEMs back to back), part of the discarded tail may be left guarded or zapped while realloc returns NULL. Failing loudly on a later access is preferred over a MADV_GUARD_REMOVE recovery that would silently expose zeroed pages.

Why off by default

In #258 it was noted this would "require having full overcommit enabled if it doesn't reduce the accounted memory", and that is what I measured. Resident memory and total address space are unchanged (RLIMIT_AS unaffected), but private-writable commit charge grows, which regresses strict overcommit (vm.overcommit_memory=2):

live allocations: commit grows by the guard size (~260 MiB for 2000 256 KiB allocations);
the quarantine dominates: quarantined regions stay committed (~1.9 GiB at default quarantine settings under sustained 1 MiB churn, vs 0 for the PROT_NONE scheme). RSS is still released because guard install zaps the pages (measured at +184 KiB resident after 2560 quarantined 1 MiB frees).

There is also a throughput cost on allocation-rate-bound workloads: guard installation writes a page-table marker for every 4 KiB page and allocates page tables for the guard range, so its cost scales with the randomized guard size, while a PROT_NONE reservation populates no page tables at all. Measured: ~-2% single-threaded churn with pages touched, -26% for pure alloc/free of 256 KiB allocations, ~2x slower in an 8-thread 256 KiB churn stress test (medians of 9 interleaved runs), and several times slower when churning allocations in the tens of MiB, where guards span thousands of pages. Measured TLB shootdown IPIs are slightly lower than with the PROT_NONE scheme, so the cost is in-kernel page-table work rather than interrupt traffic. The win is in whole-process operations that scale with VMA count, which is the actual motivation (see below). Hence opt-in rather than a default behavior change, following the CONFIG_LABEL_MEMORY precedent of a compile-time option defaulting to false.

Measurements

Linux 6.17 x86_64, 8 cores. 2000 concurrently-live 256 KiB allocations, all pages touched:

metric	PROT_NONE	MADV_GUARD_INSTALL
VMAs	+4007	+9
VmRSS	+512228 KiB	+512216 KiB
VmSize (RLIMIT_AS)	+782448 KiB	+775368 KiB
VmData (committed)	+512160 KiB	+775528 KiB

Adjacent single-VMA allocations merge, so it does better than 1 VMA/allocation. VMAs after sustained churn (2560 x 1 MiB alloc/free, full quarantine):

config	PROT_NONE	MADV_GUARD_INSTALL
default (no labeling)	+542	+535
`CONFIG_LABEL_MEMORY`	+3094	+551

That's a ~5.6x reduction under CONFIG_LABEL_MEMORY (the Android default), and <= PROT_NONE in every config. (Counts vary with the randomized guard sizes.) Whole-process operations with 2000 live allocations: the /proc/self/smaps payload drops from 2844 KiB to 46 KiB, so code that does work per VMA scans far less, and VMA-dominated fork() latency drops ~32%.

Verification

Builds clean with -Werror under gcc and clang, feature off and on, with and without CONFIG_LABEL_MEMORY; the CI matrix (gcc, clang, musl) now also runs the test suite with the feature enabled. All 56 tests pass in every configuration.
Five new regression tests cover large-allocation underflow, aligned-allocation overflow/underflow, and the realloc-shrink guard and discarded-tail paths; they assert SIGSEGV under both guard schemes, on any kernel.
On a real 6.13+ kernel: guards fault on overflow, underflow, use-after-free (quarantine), and after in-place realloc shrink, with and without CONFIG_LABEL_MEMORY; quarantined, shrunk and mremap-grown regions stay single-VMA.
mlockall(MCL_FUTURE) latches the feature off via the probe and all allocations succeed on the PROT_NONE scheme; freeing an mlock'd allocation falls back for that call only.
Failure paths are exercised directly by injecting madvise faults with strace: with every call failing ENOMEM, failing from the 7th call onward, and failing EINVAL intermittently (forcing repeated re-probes and mixed-scheme allocations), the full suite passes and guards still fault through the fallbacks.
UBSan clean (suite + churn + realloc shrink/grow). A 2-minute 8-thread randomized stress (malloc/memalign/realloc/free with per-allocation pattern verification, plus mlock'd frees racing the probe) completes ~230k operations with no corruption; the only cross-thread state is the single atomic feature flag.
The probe trusts madvise's return value, so the feature must be validated on a real kernel: qemu-user silently no-ops MADV_GUARD_INSTALL, which would leave large allocations without guards. This is a reason it must stay opt-in.

rdevshp · 2026-05-30T12:34:14Z

#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/mman.h>

int main(void) {
    const size_t size = 256 * 1024;

    errno = 0;
    void *warm = malloc(size);
    if (warm == NULL) {
        printf("warmup_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 2;
    }
    printf("warmup_large_malloc=ok ptr=%p\n", warm);

    errno = 0;
    int lock_ret = mlockall(MCL_FUTURE | MCL_ONFAULT);
    printf("mlockall_mcl_future_ret=%d errno=%d (%s)\n", lock_ret, errno, strerror(errno));

    errno = 0;
    void *after = malloc(size);
    if (after == NULL) {
        printf("post_mlock_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 1;
    }

    printf("post_mlock_large_malloc=ok ptr=%p\n", after);
    return 0;
}

This produces a regression for this program. When CONFIG_GUARD_PAGES_USE_MADVISE is false, the program runs normally, but when CONFIG_GUARD_PAGES_USE_MADVISE is true, the malloc after mlockall(MCL_FUTURE | MCL_ONFAULT); fails with errno=22.

Add CONFIG_GUARD_PAGES_USE_MADVISE (default false) to install the guard regions of large allocations with MADV_GUARD_INSTALL (Linux 6.13+) inside a single read-write mapping instead of as separate PROT_NONE mappings, keeping each large allocation to one VMA instead of three. The single-VMA property is preserved through allocate_pages(), allocate_pages_aligned(), the region quarantine and the in-place realloc shrink so it holds under allocation churn, including under CONFIG_LABEL_MEMORY where the quarantined region is named as a whole to avoid splitting the VMA. Guard install zaps any existing pages in the range, so the quarantine still purges data and frees resident memory with a single system call, the same count as the PROT_NONE remap it replaces; allocation costs one extra system call (mmap + 2 madvise instead of mmap + mprotect). Kernel support is probed on a fresh mapping at runtime and cached. Guard installation is best-effort: any madvise failure falls back to the PROT_NONE scheme. EINVAL means the specific mapping can't be guarded (VM_LOCKED), so it resets the cached state to force a re-probe: under mlockall(MCL_FUTURE) the probe mapping is itself locked and latches the feature off, while freeing a one-off mlock'd allocation only loses the single call. errno is preserved across the fallback. It is off by default because the guard bytes and quarantined regions are then accounted as committed memory (resident memory and total address space are unchanged), which regresses strict overcommit (vm.overcommit_memory=2). Add large allocation guard regression tests covering underflow, aligned allocation overflow/underflow and the in-place realloc shrink paths, which apply to both guard schemes, and build the new configuration in CI.

thomasbuilds · 2026-06-10T11:22:32Z

Thanks @rdevshp, the PR got updated quite a lot since your last review.

thomasbuilds marked this pull request as draft May 30, 2026 19:21

thomasbuilds force-pushed the madvise-guard-install branch from 9e3e3a6 to f54ee16 Compare May 30, 2026 19:43

thomasbuilds marked this pull request as ready for review June 6, 2026 08:36

thomasbuilds force-pushed the madvise-guard-install branch 2 times, most recently from ded5838 to 35a0009 Compare June 6, 2026 12:08

thomasbuilds force-pushed the madvise-guard-install branch from 35a0009 to 6252879 Compare June 10, 2026 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341
thomasbuilds wants to merge 1 commit into
GrapheneOS:mainfrom
thomasbuilds:madvise-guard-install

thomasbuilds commented May 29, 2026 •

edited

Loading

Uh oh!

rdevshp commented May 30, 2026 •

edited

Loading

Uh oh!

thomasbuilds commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

thomasbuilds commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Why off by default

Measurements

Verification

Uh oh!

rdevshp commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasbuilds commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thomasbuilds commented May 29, 2026 •

edited

Loading

rdevshp commented May 30, 2026 •

edited

Loading